Hi guys, Welcome to this tutorial. In this tutorial, we will learn to how to present your idea or suggestions to your employer by analysing the data provided to you. Our Suggestions and/or comments will be data driven and, and we hope that our analysis helps our employer in some way.

Problem/Questions

The employes gives us this data and asks us to use our analytical skills to draw insights.

1. What can be done to increase the Ridership in company’s taxis in NY?

2. Should the company sack the license of vendor 1 because of poor performance(in comparison to other vendor’s) as well as not completing the target(either in total ridership or overall journey)?

3. What sort of people should the company target specifically, and what new schemes should the company start?

4. What can be done to increase efficiency and productivity of present vendors?

5. Draw some other insights from this data that can accelerate the growth of the company?

6. Predictions for for future trip durations for this year

We will find(though not in the same sequence) the answer of these questions systematically and will try to help the company to work around some new strategies, also, apart from answering the above aforemented questions i would dive deep into the data and make some visualizations that would help us in our machine learning process.

Note - All(except some) the plots in this notebook are interactive, so i encourage you to zoom in and hover your cursor to gain detail information out of every plot

Data

#Datasets
train <- fread("train.csv")
test <- fread("test.csv")

train[, filter:= 0]
test[, filter:= 1]

#combined dataset
dataset <- bind_rows(train, test)

We have train dataset and test dataset. In this notebook, we would do our analysis primarily on the train dataset, though we would use combined dataset(train and test) sometimes. The test dataset doesn’t have drop off time and trip duration(target variable) information, that means that when using these variables, we would have to only use train dataset.

Missing data

Before starting with our analysis, we must find out the “actual amount of data”. Before starting, it is really important to look for any missing values in the provided dataset and make sure that we take the effect of those rows into account while doing our analysis.

colSums(is.na(train))
##                 id          vendor_id    pickup_datetime 
##                  0                  0                  0 
##   dropoff_datetime    passenger_count   pickup_longitude 
##                  0                  0                  0 
##    pickup_latitude  dropoff_longitude   dropoff_latitude 
##                  0                  0                  0 
## store_and_fwd_flag      trip_duration             filter 
##                  0                  0                  0
colSums(is.na(test))
##                 id          vendor_id    pickup_datetime 
##                  0                  0                  0 
##    passenger_count   pickup_longitude    pickup_latitude 
##                  0                  0                  0 
##  dropoff_longitude   dropoff_latitude store_and_fwd_flag 
##                  0                  0                  0 
##             filter 
##                  0

No missing values in train and test dataset. We have complete dataset to start our analysis.

Summary

When starting with your analysis, always, i repeat always start with the summary and the structure of the data. This step allows us to know the center of the distribution(if continous) and/or levels(if categorical) of each variable.

summary(dataset)
##       id              vendor_id     pickup_datetime    dropoff_datetime  
##  Length:2083778     Min.   :1.000   Length:2083778     Length:2083778    
##  Class :character   1st Qu.:1.000   Class :character   Class :character  
##  Mode  :character   Median :2.000   Mode  :character   Mode  :character  
##                     Mean   :1.535                                        
##                     3rd Qu.:2.000                                        
##                     Max.   :2.000                                        
##                                                                          
##  passenger_count pickup_longitude  pickup_latitude dropoff_longitude
##  Min.   :0.000   Min.   :-121.93   Min.   :34.36   Min.   :-121.93  
##  1st Qu.:1.000   1st Qu.: -73.99   1st Qu.:40.74   1st Qu.: -73.99  
##  Median :1.000   Median : -73.98   Median :40.75   Median : -73.98  
##  Mean   :1.664   Mean   : -73.97   Mean   :40.75   Mean   : -73.97  
##  3rd Qu.:2.000   3rd Qu.: -73.97   3rd Qu.:40.77   3rd Qu.: -73.96  
##  Max.   :9.000   Max.   : -61.34   Max.   :51.88   Max.   : -61.34  
##                                                                     
##  dropoff_latitude store_and_fwd_flag trip_duration         filter   
##  Min.   :32.18    Length:2083778     Min.   :      1   Min.   :0.0  
##  1st Qu.:40.74    Class :character   1st Qu.:    397   1st Qu.:0.0  
##  Median :40.75    Mode  :character   Median :    662   Median :0.0  
##  Mean   :40.75                       Mean   :    959   Mean   :0.3  
##  3rd Qu.:40.77                       3rd Qu.:   1075   3rd Qu.:1.0  
##  Max.   :48.86                       Max.   :3526282   Max.   :1.0  
##                                      NA's   :625134

Take a look at the max and min of the pickup_latitude/longitude and dropoff_latitude/longitude. The drop off point doesn’t always have to be inside New York city, for people may wish to go out of city on cab, but the pickup point should always be in the vicinity of New York city. The max and min of pickup_latitude/longitude suggests that we may have some mistakes in the dataset. We analyse this mistake later and we start our analysis with the data as we have.

Distribution of Rides

We look at the distribution of rides from different periods and intervals, to start with, and get the idea of overall rise and fall in ridership numbers during differet periods.

  • Distribution of rides is uniform.

  • No dramatic rise or fall of ridership numbers, except the midnight period.

  • Many assumptions can be made on these four.

The distribution of rides in different periods is nearly always uniform. Though the monthly, weekly and daily distibutions are uniform, the hourly distribution shows stark drop in rides during midnight. Ofcourse, people want to sleep at night, but keeping in mind the nightlife of NY, this fall in rides should be closely investigated,this and the other three distribution tells whole lot of things about the company’s strategy, nature of riders, and so on, about which i talk later in this notebook.

#extract day, hour, weekday and year from datetime
train[,":="  (pickup_datetime = ymd_hms(pickup_datetime),
              dropoff_datetime = ymd_hms(dropoff_datetime),
              pick_year = year(pickup_datetime),
              pick_month = month(pickup_datetime),
              pick_day = day(pickup_datetime),
              pick_wday = wday(pickup_datetime),
              pick_hour = hour(pickup_datetime))]

Monthly

hchart(density(train$pick_month),type = "area", color = "#B71C1C", name = "monthly distributin of rides") %>% hc_xAxis(title = list(text = "Month")) %>% hc_add_theme(hc_theme_ffx())

Weekly

hchart(density(train$pick_wday),type = "area", color = "#c1c1c9", name = "weekly distributin of rides") %>% hc_add_theme(hc_theme_ffx())

Daily

hchart(density(train$pick_day), type = "area", color = "#f08080", name = "daily distributin of rides") %>% hc_add_theme(hc_theme_ffx())

Hourly

hchart(density(train$pick_hour), type = "area", color = "#c8cbcc",name = "hourly distributin of rides") %>% hc_add_theme(hc_theme_ffx())

Now, let us focus on answering the second question.

Ridership Numbers

Count

hchart(train$vendor_id, type = "column", colorByPoint = T) %>% hc_colors(c("red", "blue")) %>% hc_chart(type = "column",
           options3d = list(enabled = TRUE, beta = 15, alpha = 15)) %>% hc_title(text = "Ridership numbers of each Vendor") %>% hc_subtitle(text = "The vendor 2 has 200000+ more trips as compared to vendor 1") %>% hc_add_theme(hc_theme_darkunica())

We look at the rides done by each vendor in percentage. Percentages are easy to interpret.

Percentage

train %>% group_by(vendor_id) %>% summarise(Count = n()) %>% mutate(percent = Count/sum(Count)) %>% plot_ly(values = ~percent, labels = ~vendor_id, type = "pie") %>% layout(showlegend = T, title = "<b>Ride percentage of both vendors</b>", xaxis = list(showgrid = F, zeroline = F, showline = F, showticklabels = F), yaxis = list(showgrid = F, zeroline = F, showline = F, showticklabels = F))
  • vendor 1 has less rides than vendor 2
  • vendor 1 has 8% less rides than vendor 2.

As can be seen from the plots, vendor 1 performs poorly, relative to vendor 2’s figures, in terms of ridership. The gap of 8% between the two is HUGE, trust me. Company’s total ridership target can be highly affected because of this difference between the two. The reason behind vendor 1’s poor performance could be several. Less number of taxis, less number of employees, as compared to vendor 2. We will look at some of the parameters that might tell us, why vendor 1 lags behind vendor 2

Trip time

As we know that trip duration is highly correlated with the fare, this information, alongwith the distance covered by the taxi, forms the important part of the company’s strategy to generate money.

Overall trip duration

ggplotly(train %>%
     ggplot(aes(trip_duration, fill = as.factor(vendor_id))) +
     geom_density(position = "stack") +
     scale_x_log10())

Once again we see vendor 1 is well behind vendor 2 in overall journey time. The density plot shows us the trip duration of 6 months as a whole, to get the idea of how a particular vendor performs at a given point of time. We will have to look at line plots.

x <- train
x$date <- as.Date(x$pickup_datetime)
f <- x %>% sample_frac(0.003, replace = F)
plot_ly(f, x = ~date) %>%
  add_lines(data = f %>% filter(vendor_id == 1), y = ~log(trip_duration), name = "vendor 1") %>%
  add_lines(data = f %>% filter(vendor_id == 2), y = ~log(trip_duration), name = "vendor 2") %>%
  layout(
    title = "Distribution of trip duration",
    xaxis = list(
      rangeselector = list(
        buttons = list(
          list(
            count = 3,
            label = "3 mo",
            step = "month",
            stepmode = "backward"),
          list(
            count = 6,
            label = "6 mo",
            step = "month",
            stepmode = "backward"),
          list(
            count = 1,
            label = "YTD",
            step = "year",
            stepmode = "todate"),
          list(step = "all"))),
      
      rangeslider = list(type = "date")),
    
    yaxis = list(title = "trip duration(log)"))
  • Vendor 1 does give some competition to vendor 2 in trip duration at some intervals.

  • vendor 2’s distribution contains several huge peaks, huge, because we are measuring the trip duration in log scale.

Mean Trip duration in different time segments

Here we pay attention to mean trip duration in different time segments from both the vendors, individually

  • vendor 1 never overtakes vendor 2 in total trip duration at any time segment, though
  • hourly distribution shows that vendor 1 comes close to overtaking vendor 2 at 6 a.m.
  • The company, for now, is now just focusing on capturing the market, as is always the case with any company. Giving this information in defence of vendor 1 isn’t going to work.

Once again, these plots show that vendor 1 is quite behind vendor 2 in total journey time. The plots show us that, at midnight, vendor1 makes high trip durations than vendor 2. Also do remember, from an earlier exploration, this was the same period that showed dramatic decline in total(from vendor1 and vendor2 combined) rides. Now, let us investigate that.

Monthly

stats <- train %>% group_by(pick_month, vendor_id) %>% summarise(trip = mean(trip_duration))
v1 <- stats %>% filter(vendor_id   == 1) %>% select(-vendor_id) %>% rename(vendor1 = trip)

v2 <- stats %>% filter(vendor_id   == 2) %>% select(-vendor_id) %>% rename(vendor2 = trip)

stats <- left_join(v1, v2, by = "pick_month") %>% as.data.frame()

stats %>% plot_ly(x = ~pick_month) %>%
  add_trace(y = ~vendor1, name = "vendor1", type = "scatter", mode ="markers+lines") %>%
  add_trace(y = ~vendor2, name = "vendor2",type = "scatter", mode = "markers+lines", visible = F) %>%
  layout(
    title = "Mean monthly trip duration",
    xaxis = list(domain = c(0.1, 1)),
    yaxis = list(title = "trip duration"),
    updatemenus = list(
      list(
        y = 0.8,
        buttons = list(
          
          list(method = "restyle",
               args = list("line.color", "blue"),
               label = "Blue"),
          
          list(method = "restyle",
               args = list("line.color", "red"),
               label = "Red"))),
      
      list(
        y = 0.7,
        buttons = list(
          list(method = "restyle",
               args = list("visible", list(TRUE, FALSE)),
               label = "vendor1"),
          
          list(method = "restyle",
               args = list("visible", list(FALSE, TRUE)),
               label = "vendor2")))
    )
)

Weekly

stats <- train %>% group_by(pick_wday, vendor_id) %>% summarise(trip = mean(trip_duration))
v1 <- stats %>% filter(vendor_id   == 1) %>% select(-vendor_id) %>% rename(vendor1 = trip)

v2 <- stats %>% filter(vendor_id   == 2) %>% select(-vendor_id) %>% rename(vendor2 = trip)

stats <- left_join(v1, v2, by = "pick_wday") %>% as.data.frame()

stats %>% plot_ly(x = ~pick_wday) %>%
  add_trace(y = ~vendor1, name = "vendor1", type = "scatter", mode ="markers+lines") %>%
  add_trace(y = ~vendor2, name = "vendor2",type = "scatter", mode = "markers+lines", visible = F) %>%
  layout(
    title = "Mean weekly trip duration",
    xaxis = list(domain = c(0.1, 1)),
    yaxis = list(title = "trip duration"),
    updatemenus = list(
      list(
        y = 0.8,
        buttons = list(
          
          list(method = "restyle",
               args = list("line.color", "blue"),
               label = "Blue"),
          
          list(method = "restyle",
               args = list("line.color", "red"),
               label = "Red"))),
      
      list(
        y = 0.7,
        buttons = list(
          list(method = "restyle",
               args = list("visible", list(TRUE, FALSE)),
               label = "vendor1"),
          
          list(method = "restyle",
               args = list("visible", list(FALSE, TRUE)),
               label = "vendor2")))
    )
)

Daily

stats <- train %>% group_by(pick_day, vendor_id) %>% summarise(trip = mean(trip_duration))
v1 <- stats %>% filter(vendor_id   == 1) %>% select(-vendor_id) %>% rename(vendor1 = trip)

v2 <- stats %>% filter(vendor_id   == 2) %>% select(-vendor_id) %>% rename(vendor2 = trip)

stats <- left_join(v1, v2, by = "pick_day") %>% as.data.frame()

stats %>% plot_ly(x = ~pick_day) %>%
  add_trace(y = ~vendor1, name = "vendor1", type = "scatter", mode ="markers+lines") %>%
  add_trace(y = ~vendor2, name = "vendor2",type = "scatter", mode = "markers+lines", visible = F) %>%
  layout(
    title = "Mean daily trip duration",
    xaxis = list(domain = c(0.1, 1)),
    yaxis = list(title = "trip duration"),
    updatemenus = list(
      list(
        y = 0.8,
        buttons = list(
          
          list(method = "restyle",
               args = list("line.color", "blue"),
               label = "Blue"),
          
          list(method = "restyle",
               args = list("line.color", "red"),
               label = "Red"))),
      
      list(
        y = 0.7,
        buttons = list(
          list(method = "restyle",
               args = list("visible", list(TRUE, FALSE)),
               label = "vendor1"),
          
          list(method = "restyle",
               args = list("visible", list(FALSE, TRUE)),
               label = "vendor2")))
    )
)

Hourly

stats <- train %>% group_by(pick_hour, vendor_id) %>% summarise(trip = mean(trip_duration))

v1 <- stats %>% filter(vendor_id   == 1) %>% select(-vendor_id) %>% rename(vendor1 = trip)

v2 <- stats %>% filter(vendor_id   == 2) %>% select(-vendor_id) %>% rename(vendor2 = trip)

stats <- left_join(v1, v2, by = "pick_hour") %>% as.data.frame()

stats %>% plot_ly(x = ~pick_hour) %>%
  add_trace(y = ~vendor1, name = "vendor1", type = "scatter", mode ="markers+lines") %>%
  add_trace(y = ~vendor2, name = "vendor2",type = "scatter", mode = "markers+lines", visible = F) %>%
  layout(
    title = "Mean hourly trip duration",
    xaxis = list(domain = c(0.1, 1)),
    yaxis = list(title = "trip duration"),
    updatemenus = list(
      list(
        y = 0.8,
        buttons = list(
          
          list(method = "restyle",
               args = list("line.color", "blue"),
               label = "Blue"),
          
          list(method = "restyle",
               args = list("line.color", "red"),
               label = "Red"))),
      
      list(
        y = 0.7,
        buttons = list(
          list(method = "restyle",
               args = list("visible", list(TRUE, FALSE)),
               label = "vendor1"),
          
          list(method = "restyle",
               args = list("visible", list(FALSE, TRUE)),
               label = "vendor2")))
    )
)

The quite period

We know, from an earlier visualisations, the number of rides drop dramatically around 4 a.m to 7 a.m at midnight. We do the exploration of this period, try to understand the data that comes under this period.

sub <- train %>% filter(pick_hour %in% c(4, 5, 6))

sub %>% group_by(vendor_id) %>% summarise(count = n()) %>% mutate(percent = count/sum(count)) %>% plot_ly(values = ~percent, labels = ~vendor_id, type="pie") %>% layout(showlegend = T, title = "<b>Ride percentage of both vendors</b>", xaxis = list(showgrid = F, zeroline = F, showline = F, showticklabels = F), yaxis = list(showgrid = F, zeroline = F, showline = F, showticklabels = F))

Wow, look at the ride percentage of vendor 1 at midnight. vendor 1 does quite a good job in bringing the ride share to 50:50. This is quite a good achievement for vendor 1.

Suggestion

  1. Company shouldn’t withdraw the license of vendor1, if possible.
  2. withdrawing the license may not only hit the company’s short-term ridership numbers but also the company may loose many of the midnight rides where vendor1’s future prospects look promising
  3. Still, company should ask for reasons of vendor1’s incapability to meet the target given to him and then take the necessary step to resolve those issues.

Now, let us ask ourselves a question. Why and were does actually vendor 1 lag behind vendor 2? because of which he is nowhere close to vendor 2 either in ridership numbers or in total trip durations.

This, i try to answer at the end of this notebook, so stay till the end.

Feature(pickup and dropoff point)

Now, we create two new features to look for destinations that attract most and least taxi ridership, these would be pickup point and dropoff point. These features aren’t derived by some fancy feature engineering techniques, and are simply just a combination of latitude and longitude.

#dropoff and pickup point
train[, pickup_point := paste(pickup_latitude, pickup_longitude)]
train[, dropoff_point:= paste(dropoff_latitude, dropoff_longitude)]

Bottom 50 destinations

  • Pickup points and dropoff points look similar, though pickup points look little bit clustered.
  • Some pickups and dropoffs appear as mistakes and are located inside the ocean
  • Dropoff points and pickup points are far away from NY.

What? pickup and dropoff point inside the ocean? Well, i don’t know about you but i am assuming that driver on his way had a sudden mood swing and just drove into the ocean for a swim.

Haha, now let’s not laugh on this moronic joke and focus on the other pickups as the pickup and dropoff points inside the ocean are surely mistakes.

The Dropoff points, except the ones inside the ocean seems genuine, remember that the dropoffs can be outside NY but not the pickups, but we do have some pickup points near to the dropoff points which are outside the NY city.

What does this suggest? This suggests that the taxi on his way back to the NY city might have got some passengers and they might be heading to NY city. The clusters in the pickups also tell us that the driver may have got 5 to 6 passengers who headed to the ney york city or they may be headed to area around the dropoff point. We will investigate this further, surely.

Pickup

g <- train %>% group_by(pickup_latitude, pickup_longitude) %>% summarise(count = n())

g <- g[with(g, order(count)),]

col_pickup <- colorNumeric(topo.colors(200), g$count[1:50])
leaflet(g[1:50,]) %>% addTiles() %>% addMarkers(~pickup_longitude, ~pickup_latitude, popup = ~paste(pickup_latitude, pickup_longitude)) %>% addLegend("bottomright", colors = "blue", labels = "pickups", title = "50 least popular pickup points", opacity = 0.3)

Dropoff

g <- train %>% group_by(dropoff_latitude, dropoff_longitude) %>% summarise(count = n())

g <- g[with(g, order(count)),]

col_dropoff <- colorNumeric(topo.colors(200), g$count[1:50])
leaflet(g[1:50,]) %>% addTiles() %>% addMarkers(~dropoff_longitude, ~dropoff_latitude, popup = ~paste(dropoff_latitude, dropoff_longitude, sep = ",")) %>% addLegend("bottomright",colors = "blue", labels = "dropoffs", title = "50 least popular dropoff points", opacity = 0.3)

Top 50 destinations

  • Pickup and dropoff points appear as clusters, a good sign for the company.

  • We see three main clusters of pickup points, 1st cluster is situated at the La guardia airport, and can be seen without zooming, 2nd cluster is at JF kennedy airport, and can be seen after going down the La guardia airport and zooming a little bit, and 3rd cluster appears at the New York Penn Station, where the Madison Square Garden is situated, and can be seen by little bit of zooming.

  • We see three main dropoff points and a possible forth main drop off points. First main dropoff point is the La guardia airport. Second we see at the New York Penn Station and slightly away from it at the Manhattan Community Board. Third we see at the 12th avenue.

  • The second and third dropoff points are very close to city bike station. This is understandable because NY has the highest car-free households, that stands at 56% and New Yorkers prefer highly prefer to use public transport or like to ride a bicyclye to work.

  • fourth possible main dropoff point can be the NewYork’s Presbyterian Hospital and shows only one ride in this map, but if more points are plotted, we may surely see a cluster of dropoffs forming around this destination.

These two plots give a lot of information to us and make us capable of answering the 1st and the 3rd question.

You can infer by the above plots that no coordinate has counts above 35, this is really surprising because the pickup points we are talking about includes two most important and highly lucrative(in terms of ridership as well as money) pickup points for any ride hailing company, the airports. Though we know that several of the coordinates point to the same destination, with slight deviations in latitude or longitude, and even if we consider the total pickups at both the Airport at around 1000 each, during this 6 months, the ridership at the airports are still astonishingly low for the company in a city like New York, this, after we know that the train dataset is from Jan of 2016 and includes New Year as well as christmas season, till the June of 2016. Company needs to look in to this matter

One more thing to note is that the actual number of pickup or dropoff destinations can be narrowed to some number because as you can see from the above plot, the coordinates point to only one destination but with slight deviations in latitudes or longitudes. So this information also tells us that it will be good to run clustering algos on pickup and dropoff coordinates. Let us try to visualize this and see what we get.

Pickup

g <- train %>% group_by(pickup_latitude, pickup_longitude) %>% summarise(count = n()) %>% arrange(desc(count))

col_pickup <- colorNumeric(topo.colors(100), g$count[1:50])
leaflet(g[1:50,]) %>% addTiles() %>% addCircles(~pickup_longitude, ~pickup_latitude,popup = ~paste(pickup_longitude, pickup_latitude), color= ~col_pickup(count)) %>% addLegend("bottomright",pal = col_pickup, values= g$count[1:50], title = "50 most popular pickup points", opacity = 0.3)

Dropoff

g <- train %>% group_by(dropoff_latitude, dropoff_longitude) %>% summarise(count = n()) %>% arrange(desc(count))

col_dropoff <- colorNumeric(topo.colors(100), g$count[1:50])
leaflet(g[1:50,]) %>% addTiles() %>% addCircles(~dropoff_longitude, ~dropoff_latitude,popup = ~paste(dropoff_longitude, dropoff_latitude), color= ~col_dropoff(count)) %>% addLegend("bottomright",pal = col_dropoff, values= g$count[1:50], title = "50 most popular pickup points", opacity = 0.3)

Possible duplicates

Here, we look at the coordinates that points to the same direction and just deviate by few angles, either latitude or longitude.

fixtable <- function(...) {
  tab <- table(...)
  if (substr(colnames(tab)[1],1,1) == "_" &
      substr(rownames(tab)[1],1,1) == "_") {
    tab2 <- tab
    colnames(tab2) <- sapply(strsplit(colnames(tab2), split=" "), `[`, 1)
    rownames(tab2) <- sapply(strsplit(rownames(tab2), split=" "), `[`, 1)
    tab2[1,1] <- 0
    # mandat w klubie
    for (par in names(which(tab2[1,] > 0))) {
      delta = min(tab2[par, 1], tab2[1, par])
      tab2[par, par] = tab2[par, par] + delta
      tab2[1, par] = tab2[1, par] - delta
      tab2[par, 1] = tab2[par, 1] - delta
    }
    # przechodzi przez niezalezy
    for (par in names(which(tab2[1,] > 0))) {
      tab2["niez.", par] = tab2["niez.", par] + tab2[1, par]
      tab2[1, par] = 0
    }
    for (par in names(which(tab2[,1] > 0))) {
      tab2[par, "niez."] = tab2[par, "niez."] + tab2[par, 1]
      tab2[par, 1] = 0
    }
    
    tab[] <- tab2[] 
  }
  tab
}

h <- train %>% group_by(pickup_latitude, pickup_longitude) %>% summarise(count = n()) %>% arrange(desc(count)) %>% head(50)

flow2 <- data.frame(fixtable(z = paste0(h$pickup_latitude), do = paste0(h$pickup_longitude)))


flow2 <- flow2[flow2[,3] > 0,]

nodes2 <- data.frame(name=unique(c(levels(factor(flow2[,1])), levels(factor(flow2[,2])))))
nam2 <- seq_along(nodes2[,1])-1
names(nam2) <- nodes2[,1]

links2 <- data.frame(source = nam2[as.character(flow2[,1])],
                                        target = nam2[as.character(flow2[,2])],
                                        value = flow2[,3])

sankeyNetwork(Links = links2, Nodes = nodes2,
                            Source = "source", Target = "target",
                            Value = "value", NodeID = "name",
                            fontFamily = "Arial", fontSize = 12, nodeWidth = 40,
                            colourScale = "d3.scale.category20()")

Those latitudes and longitudes having more than one connection either way may be duplicates and point to the same nearby destination, so be careful when feeding raw latitudes and longitudes to your machine learning algo, preprocess accordingly.

Note - The high density cluster of the pickups or dropoff are good for the company, because it lets the company to focus and concentrate their whole effort at a specific point, instead of dividing their concentration on several low density clusters of pickups and/or dropoffs, if they were far away from each other.

Suggestions

  1. Company must concentrate on increasing the ridership around the two airport belt.
  2. Company can start new schemes for the passengers going to or coming from the airport.
  3. Company can also award a new license to a new vendor which would cater to these schemes or to the passengers coming or leaving this belt

Now, we investigate the less frequent and low trip duration pickups and dropups. This will give us the idea of the pickups outside of the NY city.

setorder(train, trip_duration) 
g <- train[1:20,]
g$numb <- 1:20
leaflet(g) %>% addTiles() %>% addCircles(~pickup_longitude, ~pickup_latitude, color = "red", popup =~paste("pickup:", as.character(pickup_datetime), " numb: ", numb, "trip: ", trip_duration), radius = 30)%>% addCircles(~dropoff_longitude, ~dropoff_latitude, color = "blue", popup =~paste("dropoff:", as.character(dropoff_datetime), " numb: ", numb), radius = 30) %>% addLegend("bottomright",colors = c("red", "blue"), label= c("<b>pickup</b>", "<b>dropoff</b>"), title = "Distance between pickup and dropoff points of outliers", opacity = 0.3)

Can you spot the pickups? well, you can’t, because the pickups have been hidden by dropoffs. The pickup coordinates and dropoff coordinates are same. What this tells us is that these entries could be mistakes on the part of the driver. These entries could have been mistakenly entered by the driver.

Feature Idea

This could potentially cause another dimension of duplicacy, though i feel it wouldn’t be much of an issue. But after seeing this, the first potential feature that came into my mind was string distance, this feature can help us in capturig this effect of duplicacy.

I hope you would try this feature.

Network

setorder(train, pickup_datetime)

networkDF <- train[, .(pickup = first(pickup_point), low_trip = min(trip_duration), high_trip = max(trip_duration), min_passenger = min(passenger_count), max_passenger = max(passenger_count), early_year = min(pick_month), late_year = max(pick_month), early_month= min(pick_day), late_month = max(pick_day), early_wday = min(pick_wday), late_wday = max(pick_day), vendor = first(vendor_id)), by = dropoff_point] 

networkDF[,":="(drop_prev = lag(dropoff_point), drop_now = dropoff_point, vendor_now = vendor, vendor_prev = lag(vendor))]
#data frame for network


networkDF <- networkDF[, list(pickup, drop_prev, drop_now, low_trip, high_trip, min_passenger, max_passenger, early_year, late_year, early_month, late_month, early_wday, late_wday, vendor_now, vendor_prev)]

#network plotting function
visNetworkPerPickup <- function(df,seconds,passenger, mon, vendor, day){
    edges <- df %>% 
    filter(low_trip <= seconds) %>%
    filter(min_passenger <= passenger) %>%
    filter(early_month <= mon) %>% 
    filter(early_wday <= day) %>%
    filter(vendor_now == vendor) %>%
    select(pickup, drop_now) %>%
    rename(from = pickup) %>%
    rename(to = drop_now) %>%
    sample_n(30, replace = F) %>%
    ungroup() %>%
    mutate(arrows = "from")
  
  
  edgesMelt <- edges %>%
    mutate(shape = "") %>%
    melt(id = "shape", measure= c("to", "from"), value.name = "id")
  
  
  nodesPickup <- edgesMelt %>% 
    filter(variable == "from") %>%
    mutate(group = "pickup")
  
  
  nodesDrop <- edgesMelt %>%
    filter(variable == "to") %>%
    mutate(group = "drop")
  
  
  nodes <- rbind(nodesPickup, nodesDrop) %>% select(variable, id, group) %>% unique()
  
   visNetwork(nodes, edges, main = list(text = paste0("Pickup and drop off points "),
                                       style = "font-family:Comic Sans MS;color:#ff0000;font-size:15px;text-align:center;")) %>%
    visGroups( groupname = "pickup", color = "lightgreen") %>%
    visGroups( groupname = "drop", color = "lightblue") %>%
    visOptions(highlightNearest = list(enabled = TRUE, degree =1), nodesIdSelection = T) %>%
    visInteraction(dragNodes = T, dragView = T, zoomView = T)  %>%
    visGroups(groupname = "pickup", shape = "icon", icon = list(code = "f21d", size =100)) %>%
    visGroups(groupname = "drop", shape = "icon", icon = list(code = "f1ba", color = "green")) %>%
    addFontAwesome() %>%
    visInteraction(navigationButtons = TRUE) 
}

Let us now visualize the pick up and drop off coordinates of both vendor 1 and vendror 2 through network plot.

The network plots allow us to narrow down the number of actual interactive points by filtering it by the given different threshold for different variables. This allows us to easily focus and understand only a particualar aspect of the data on a given time.

we only look at some parts of the train dataset, if you want to analyse other parts you are free to download the code and try it in your local environment.

Trip<=200, passenger_count<=2, month<=4, day<=2, vendor_id==2

set.seed(1100)
visNetworkPerPickup(networkDF, 200, 2, 4, 2, 2)

Trip<=400, passenger_count<=2, month<=4, day<=2, vendor_id==1

networkDF <- train[, .(pickup = first(pickup_point), low_trip = min(trip_duration), high_trip = max(trip_duration), min_passenger = min(passenger_count), max_passenger = max(passenger_count), early_year = min(pick_month), late_year = max(pick_month), early_month= min(pick_day), late_month = max(pick_day), early_wday = min(pick_wday), late_wday = max(pick_day), vendor = first(vendor_id)), by = dropoff_point] 

networkDF[,":="(drop_prev = lag(dropoff_point), drop_now = dropoff_point, vendor_now = vendor, vendor_prev = lag(vendor))]
networkDF <- networkDF[, list(pickup, drop_prev, drop_now, low_trip, high_trip, min_passenger, max_passenger, early_year, late_year, early_month, late_month, early_wday, late_wday, vendor_now, vendor_prev)]

#network plotting function
visNetworkPerPickup <- function(df,seconds,passenger, mon, vendor, day){
    edges <- df %>% 
    filter(low_trip <= seconds) %>%
    filter(min_passenger <= passenger) %>%
    filter(early_month <= mon) %>% 
    filter(early_wday <= day) %>%
    filter(vendor_now == vendor) %>%
    select(pickup, drop_now) %>%
    rename(from = pickup) %>%
    rename(to = drop_now) %>%
    sample_n(30, replace = F) %>%
    ungroup() %>%
    mutate(arrows = "from")
  
  
  edgesMelt <- edges %>%
    mutate(shape = "") %>%
    melt(id = "shape", measure= c("to", "from"), value.name = "id")
  
  
  nodesPickup <- edgesMelt %>% 
    filter(variable == "from") %>%
    mutate(group = "pickup")
  
  
  nodesDrop <- edgesMelt %>%
    filter(variable == "to") %>%
    mutate(group = "drop")
  
  
  nodes <- rbind(nodesPickup, nodesDrop) %>% select(variable, id, group) %>% unique()
  
   visNetwork(nodes, edges, main = list(text = paste0("Pickup and drop off points "),
                                       style = "font-family:Comic Sans MS;color:#ff0000;font-size:15px;text-align:center;")) %>%
    visGroups( groupname = "pickup", color = "lightgreen") %>%
    visGroups( groupname = "drop", color = "lightblue") %>%
    visOptions(highlightNearest = list(enabled = TRUE, degree =1), nodesIdSelection = T) %>%
    visInteraction(dragNodes = T, dragView = T, zoomView = T)  %>%
    visGroups(groupname = "pickup", shape = "icon", icon = list(code = "f0ac", size =100)) %>%
    visGroups(groupname = "drop", shape = "icon", icon = list(code = "f162", color = "green")) %>%
    addFontAwesome() %>%
    visInteraction(navigationButtons = TRUE) 
}


set.seed(1300)
visNetworkPerPickup(networkDF, 200, 2, 4, 2, 1)

The pickups and dropoffs that you see on the network plots have been filtered and plotted on the network. These plots tell us that which points fall into a category, when arranged and filtered by different threshold combinations of different variables.

Uniqueness

Uniqueness, in here, means, on a given day or in a time interval, how many times each vendor picks up the passenger from a new destination? a destination which he didn’t visit before during that time interval.

Why is this important?

This is really important because this allows us to get the idea about what type of people the vendors or, if i may say, the company target. This will allow us to answer the 4th question.

Monthly

Vendor 1

h <- train[, .(trip = mean(trip_duration), uniqueness = length(unique(pickup_point))), by = list(pick_month, vendor_id)] 
h1 <- h %>% filter(vendor_id == 1)
h2 <- h %>% filter(vendor_id == 2)
hchart(h1, "treemap", hcaes(x = as.character(pick_month), value = uniqueness, color = uniqueness)) %>% hc_title(text = "Monthly mean trip Duration of both vendors") %>% hc_subtitle(text = "Vendor 2 is well ahead of Vendor 1 again")

Vendor 2

hchart(h2, "treemap", hcaes(x = as.character(pick_month), value = trip, color = uniqueness)) %>% hc_title(text = "Monthly mean trip Duration of both vendors") %>% hc_subtitle(text = "Vendor 2 is well ahead of Vendor 1 again")

Weekly

Vendor 1

h <- train[, .(trip = mean(trip_duration), uniqueness = length(unique(pickup_point))), by = list(pick_wday, vendor_id)] 
h1 <- h %>% filter(vendor_id == 1)
h2 <- h %>% filter(vendor_id == 2)
hchart(h1, "treemap", hcaes(x = as.character(pick_wday), value = trip, color = uniqueness)) %>% hc_title(text = "Weekly mean trip Duration of both vendors") %>% hc_subtitle(text = "Vendor 2 is well ahead of Vendor 1 again")

vendor 2

hchart(h2, "treemap", hcaes(x = as.character(pick_wday), value = trip, color = uniqueness)) %>% hc_title(text = "Weekly mean trip Duration of both vendors") %>% hc_subtitle(text = "Vendor 2 is well ahead of Vendor 1 again")

Daily

vendor 1

h <- train[, .(trip = mean(trip_duration), uniqueness = length(unique(pickup_point))), by = list(pick_day, vendor_id)] 
h1 <- h %>% filter(vendor_id == 1)
h2 <- h %>% filter(vendor_id == 2)
hchart(h1, "treemap", hcaes(x = as.character(pick_day), value = trip, color = uniqueness)) %>% hc_title(text = "Daily mean trip Duration of both vendors with uniqueness") %>% hc_subtitle(text = "Vendor 2 is well ahead of Vendor 1 again")

vendor 2

hchart(h2, "treemap", hcaes(x = as.character(pick_day), value = trip, color = uniqueness)) %>% hc_title(text = "Daily mean trip Duration of both vendors with uniqueness") %>% hc_subtitle(text = "Vendor 2 is well ahead of Vendor 1 again")

Hourly

Vendor 1

h <- train[, .(trip = mean(trip_duration), uniqueness = length(unique(pickup_point))), by = list(pick_hour, vendor_id)]
h1 <- h %>% filter(vendor_id == 1)
h2 <- h %>% filter(vendor_id == 2)
hchart(h1, "treemap", hcaes(x = as.character(pick_hour), value = trip, color = uniqueness)) %>% hc_title(text = "Hourly mean trip Duration of both vendors with uniqueness") %>% hc_subtitle(text = "Vendor 2 is well ahead of Vendor 1 again")

Vendor 2

hchart(h2, "treemap", hcaes(x = as.character(pick_hour), value = trip, color = uniqueness)) %>% hc_title(text = "Hourly mean trip Duration of both vendors with uniqueness") %>% hc_subtitle(text = "Vendor 2 is well ahead of Vendor 1 again")

Passengers

counts of passengers

k <- dataset[ , .(count = .N), by = passenger_count]
col <- viridis(11)
col <- substr(col, 0, 7)

col <- c("limegreen", "black", "red", "green", "blue", "brown", "yellow","violet", "magenta", "purple", "pink")

parts <- c(`0` = 60, `1` = 1476987, `2` = 300345, `3` = 85582, `4` = 40421, `5`= 111499, `6` = 68854, `7` = 3, `8` = 1, `9` = 1)
waffle(parts/10000, rows = 9, colors = col, legend_pos = "bottom", title = "Frequency of number of passengers", xlab= "1 square = 10000", pad =  2)

  • As exected, single passenger rides dominate the total overall rides.

  • Then comes two-passenger rides, and then surprisingly comes five-passenger rides.

The 1-passenger and 2-passenger rides followed directly by 5-passenger rides in terms of highest ridership numbers, is a sort of mystery to me. We’ll explore this information further.

Trip duration trend.

h <- train 
h[, above3 := ifelse(passenger_count >=3, "yes", "no")]
h[, date := as.Date(pickup_datetime)]
##                 id vendor_id     pickup_datetime    dropoff_datetime
##       1: id0190469         2 2016-01-01 00:00:17 2016-01-01 00:14:26
##       2: id1665586         1 2016-01-01 00:00:53 2016-01-01 00:22:27
##       3: id1210365         2 2016-01-01 00:01:01 2016-01-01 00:07:49
##       4: id3888279         1 2016-01-01 00:01:14 2016-01-01 00:05:54
##       5: id0924227         1 2016-01-01 00:01:20 2016-01-01 00:13:36
##      ---                                                            
## 1458640: id1255468         2 2016-06-30 23:58:52 2016-07-01 00:06:44
## 1458641: id2013516         1 2016-06-30 23:59:09 2016-07-01 00:11:43
## 1458642: id0376262         1 2016-06-30 23:59:10 2016-07-01 00:41:36
## 1458643: id2332349         2 2016-06-30 23:59:37 2016-07-01 00:23:39
## 1458644: id3719493         2 2016-06-30 23:59:39 2016-07-01 00:43:08
##          passenger_count pickup_longitude pickup_latitude
##       1:               5        -73.98174        40.71916
##       2:               1        -73.98508        40.74717
##       3:               5        -73.96528        40.80104
##       4:               1        -73.98229        40.75133
##       5:               1        -73.97011        40.75980
##      ---                                                 
## 1458640:               1        -73.98275        40.74529
## 1458641:               2        -73.95511        40.68956
## 1458642:               2        -73.87309        40.77410
## 1458643:               5        -73.99829        40.72262
## 1458644:               1        -73.97842        40.79158
##          dropoff_longitude dropoff_latitude store_and_fwd_flag
##       1:         -73.93883         40.82918                  N
##       2:         -73.95804         40.71749                  N
##       3:         -73.94748         40.81517                  N
##       4:         -73.99134         40.75034                  N
##       5:         -73.98936         40.74299                  N
##      ---                                                      
## 1458640:         -73.96900         40.75795                  N
## 1458641:         -73.97820         40.68580                  N
## 1458642:         -73.92670         40.85674                  N
## 1458643:         -73.97178         40.76247                  N
## 1458644:         -73.97316         40.67597                  N
##          trip_duration filter pick_year pick_month pick_day pick_wday
##       1:           849      0      2016          1        1         6
##       2:          1294      0      2016          1        1         6
##       3:           408      0      2016          1        1         6
##       4:           280      0      2016          1        1         6
##       5:           736      0      2016          1        1         6
##      ---                                                             
## 1458640:           472      0      2016          6       30         5
## 1458641:           754      0      2016          6       30         5
## 1458642:          2546      0      2016          6       30         5
## 1458643:          1442      0      2016          6       30         5
## 1458644:          2609      0      2016          6       30         5
##          pick_hour                       pickup_point
##       1:         0 40.7191581726074 -73.9817428588867
##       2:         0 40.7471656799316 -73.9850845336914
##       3:         0 40.8010406494141 -73.9652786254883
##       4:         0  40.7513313293457 -73.982292175293
##       5:         0 40.7597999572754 -73.9701080322266
##      ---                                             
## 1458640:        23  40.745288848877 -73.9827499389648
## 1458641:        23 40.6895637512207 -73.9551086425781
## 1458642:        23  40.774097442627 -73.8730926513672
## 1458643:        23  40.7226181030273 -73.998291015625
## 1458644:        23  40.791576385498 -73.9784164428711
##                               dropoff_point above3       date
##       1: 40.8291816711426 -73.9388275146484    yes 2016-01-01
##       2: 40.7174911499023 -73.9580383300781     no 2016-01-01
##       3: 40.8151702880859 -73.9474792480469    yes 2016-01-01
##       4:  40.7503395080566 -73.991340637207     no 2016-01-01
##       5: 40.7429885864258 -73.9893569946289     no 2016-01-01
##      ---                                                     
## 1458640: 40.7579498291016 -73.9690017700195     no 2016-06-30
## 1458641: 40.6858024597168 -73.9782028198242     no 2016-06-30
## 1458642: 40.8567390441895 -73.9267044067383     no 2016-06-30
## 1458643: 40.7624702453613 -73.9717788696289    yes 2016-06-30
## 1458644:  40.675968170166 -73.9731597900391     no 2016-06-30
h <- h[, .(mean_trip = mean(trip_duration)), by = list(above3, date)]

no <- h %>% filter(above3 == "no") %>% select(-above3) %>% rename(no = mean_trip)
yes <- h %>% filter(above3 == "yes")  %>% select(-above3) %>% rename(yes = mean_trip)

all <- merge(no, yes, by = "date", all.x = T, sort = F)

highchart() %>% hc_xAxis(categories = all$date) %>% hc_add_series(data = all$no, name = "below 3") %>% hc_add_series(data = all$yes, name = "3 or above") %>% hc_yAxis(title = list(text = "Trip duration(seconds)"), allowDecimals = FALSE)  %>%hc_title(text = "Mean trip durations") %>% hc_subtitle(text = "Some outliers in are visible in the plot") %>% hc_colors(c("#F0A1EA", "#76D7C4")) %>% hc_add_theme(hc_theme_chalk())
  • Two huge peaks on 5th jan and 13th feb seem as outlier.

  • These two dates could create problems in our machine learning process.

  • Rise of 5th Jan may suggest the effect of New Year season and 13th feb may indicate the effect of valentine’s day

  • As can be seen, a trip having more than 3 passengers has average trip duration higher than that of a trip having single passenger.

  • Ofcourse, this doesn’t tell us whether this fetches more money to the company or not, but this, still, gives us a hint of how much money a vendor would generate in comparison to a single passenger ride.

We filtered our dataset according to whether the passenger count is 3 or above, and 2 or less. The sudden jump in trip duration on feb 13 is due to the increase in the two-passenger rides(means a journey having atleast two passengers), this, i confirmed by filtering the train dataset by passenger_count of atleast 2 or above and the filtered datset contained all the data with passenger_count variable having minimum 2, and then i plotted the mean trip duration which didn’t show the 13th feb peak but the 5th jan peak still remained. I am quite sure, the majority of those 13th feb two-passenger rides were dominated by lovey-dovey couples.

Suggestions

  • Try to suggest the vendors to diversify their daily passenger ride serving, specifically on the dates that have less uniqueness as shown above.

  • Also try to give special concessions or start new schemes for New Year and and Valintine’s day, if the schemes don’t exist.

  • Also try to encourage the vendors to serve more than 3 passengers on a single ride, this may help in diversity of rides as well as money generation.

Distance

i <- cbind(longitude = train$pickup_longitude, latitude = train$pickup_latitude)
j <- cbind(longitude = train$dropoff_longitude, latitude= train$dropoff_latitude)

train[, distance := distHaversine(i, j)]
train %>% sample_frac(0.003)%>% plot_ly(x = ~log(trip_duration), y = ~log(distance), alpha = 0.3) %>% add_markers(marker = list(line = list(color = "black", width = 1))) %>%
  layout(
    title = "<b>Relation of journey time and distance</b>",
    xaxis = list(domain = c(0.1, 1), title = "<b><i>trip duration(log)</i></b>"),
    yaxis = list(title = "<b><i>distance(log)</i></b>"),
    updatemenus = list(
      list(
        y = 0.8,
        buttons = list(
          
          list(method = "restyle",
               args = list("type", "scatter"),
               label = "Scatter"),
          
          list(method = "restyle",
               args = list("type", "histogram2d"),
               label = "2D Histogram")))
))

Wow, this is quite an interesting plot. This plot tells us that log transformation of trip duration and distance has linear relation between them, and this would really help us in pur machine learning part.

Both x and y axis are in log scale, you can try to plot the points in original scale, you would find that the plot isn’t interpretable. This, also, tells us that we have quite a huge skewness in both the variables. Log transformation will surely increase the predictability of our machine learning process

Fitting loess

set.seed(78)

h <- train %>% sample_frac(0.002, replace =F)
h$trip_duration <- log(h$trip_duration)
h$distance <- log(h$distance+1)

m <- loess(trip_duration ~ distance, data = h)

h %>% plot_ly( x = ~distance, color = I("black")) %>%
  add_markers(y = ~trip_duration, text = paste("rowno: ", rownames(h)), showlegend = FALSE) %>%
  add_lines(y = ~fitted(loess(trip_duration~ distance)),
            line = list(color = '#07A4B5'),
            name = "Loess Smoother", showlegend = TRUE) %>%
  add_ribbons(data = augment(m),
              ymin = ~.fitted - 1.96 * .se.fit,
              ymax = ~.fitted + 1.96 * .se.fit,
              line = list(color = 'rgba(7, 164, 181, 0.05)'),
              fillcolor = 'rgba(7, 164, 181, 0.2)',
              name = "Standard Error") %>%
  layout(xaxis = list(title = '<i><b>distance</b></i>'),
         yaxis = list(title = '<i><b>trip duration<b></i>'),
         legend = list(x = 0.80, y = 0.90))

We can see that the loess smoother is passing right through the middle of the distribution of points, though we see quite a small standard error.

Cluster

I talked with you, earlier, about the importance of clustering the latitudes and longitudes. This is really essential for getting the correct number of destinations(centroids) that have these pickup and dropoff points clustered arround it. This will also be helpful in prediction of trip durations.

#clust <- leaderCluster(points = i, radius = 4, max_iter = 10, distance = #"haversine")

set.seed(76)
cluster <- kmeans(i, 50, nstart = 15)

train[, clusters := cluster$cluster]
cluster1_long <- train$pickup_longitude[train$clusters == 1] %>% head(100)
cluster1_lat <- train$pickup_latitude[train$clusters == 1]%>% head(100)

cluster2_long <- train$pickup_longitude[train$clusters == 2]%>% head(100)
cluster2_lat <- train$pickup_latitude[train$clusters == 2]%>% head(100)

cluster3_long <- train$pickup_longitude[train$clusters == 3]%>% head(100)
cluster3_lat <- train$pickup_latitude[train$clusters == 3]%>% head(100)

cluster4_long <- train$pickup_longitude[train$clusters == 4]%>% head(100)
cluster4_lat <- train$pickup_latitude[train$clusters == 4]%>% head(100)

cluster6_long <- train$pickup_longitude[train$clusters == 6]%>% head(100)
cluster6_lat <- train$pickup_latitude[train$clusters == 6]%>% head(100)


cluster1 = list(
  type = 'circle',
  xref ='x', yref='y',
  x0=min(cluster1_long), y0=min(cluster1_lat),
  x1=max(cluster1_long), y1=max(cluster1_lat),
  opacity=0.25,
  line = list(color="#835AF1"),
  fillcolor="#835AF1")

cluster2 = list(
  type = 'circle',
  xref ='x', yref='y',
  x0=min(cluster2_long), y0=min(cluster2_lat),
  x1=max(cluster2_long), y1=max(cluster2_lat),
  opacity=0.25,
  line = list(color="#1ABC9C"),
  fillcolor="#1ABC9C")

cluster3 = list(
  type = 'circle',
  xref ='x', yref='y',
  x0=min(cluster3_long), y0=min(cluster3_lat),
  x1=max(cluster3_long), y1=max(cluster3_lat),
  opacity=0.25,
  line = list(color="#F8C471"),
  fillcolor="#F8C471")


cluster4 = list(
  type = 'circle',
  xref ='x', yref='y',
  x0=min(cluster4_long), y0=min(cluster4_lat),
  x1=max(cluster4_long), y1=max(cluster4_lat),
  opacity=0.25,
  line = list(color="#F1948A"),
  fillcolor="#F1948A")


cluster6 = list(
  type = 'circle',
  xref ='x', yref='y',
  x0=min(cluster6_long), y0=min(cluster6_lat),
  x1=max(cluster6_long), y1=max(cluster6_lat),
  opacity=0.25,
  line = list(color="#3498DB"),
  fillcolor="#3498DB")


# updatemenus component
updatemenus <- list(
  list(
    active = -1,
    type = 'buttons',
    buttons = list(

      list(
        label = "None",
        method = "relayout",
        args = list(list(shapes = c()))),

      list(
        label = "Cluster 1",
        method = "relayout",
        args = list(list(shapes = list(cluster1, c(), c(), c(), c())))),

      list(
        label = "Cluster 2",
        method = "relayout",
        args = list(list(shapes = list(c(), cluster2, c(), c(), c())))),

      list(
        label = "Cluster 3",
        method = "relayout",
        args = list(list(shapes = list(c(), c(), cluster3, c(), c())))),
      
      list(
        label = "Cluster 4",
        method = "relayout",
        args = list(list(shapes = list(c(), c(), c(), cluster4, c())))),
      
      list(
        label = "Cluster 6",
        method = "relayout",
        args = list(list(shapes = list(c(), c(), c(), c(), cluster6)))),

      list(
        label = "All",
        method = "relayout",
        args = list(list(shapes = list(cluster1,cluster2,cluster3, cluster4,
                    cluster6))))
    )
  )
)

plot_ly(type = 'scatter', mode='markers') %>%
  add_trace(x=cluster1_long, y=cluster1_lat, mode='markers', marker=list(color='#835AF1')) %>%
  add_trace(x=cluster2_long, y=cluster2_lat, mode='markers', marker=list(color='#1ABC9C')) %>%
  add_trace(x=cluster3_long, y=cluster3_lat, mode='markers', marker=list(color='#F8C471')) %>%
    add_trace(x=cluster4_long, y=cluster4_lat, mode='markers', marker=list(color='#F1948A')) %>%
    add_trace(x=cluster6_long, y=cluster6_lat, mode='markers', marker=list(color='#3498DB')) %>%
  layout(title = "Highlight Pickup Clusters", showlegend = FALSE,
         updatemenus = updatemenus)

We have, only, plotted 100 points from each cluster. This lets us visualize the clusters more clearly without cluttering the plot with 1000+ points. We still need more efficiency to gauge actual number of pickup clusters. Cluster 3 has points that are well spread out and may contain some False Positives.

Future prospects

Though to predict the future trip duration, we will have to go throuh machine learning process, but we, still, can partly answer the 6th question by using the markov assumption.

d1 <- as.Date("2016-06-01")

d2 <- as.Date("2016-06-30")

last_30 <- train %>% mutate(date = as.Date(pickup_datetime)) %>% arrange(date) %>% filter(date %in% c(d1:d2)) %>% group_by(date) %>% mutate(center = mean(trip_duration))

lag_3 <- unique(last_30[, c("date","center")])
lag_3$id <- seq.int(nrow(lag_3))

accumulate_by <- function(dat, var) {
  var <- lazyeval::f_eval(var, dat)
  lvls <- plotly:::getLevels(var)
  dats <- lapply(seq_along(lvls), function(x) {
    cbind(dat[var %in% lvls[seq(1, x)], ], frame = lvls[[x]])
  })
  dplyr::bind_rows(dats)
}

df <- lag_3 %>% as.data.frame() %>%
  accumulate_by(~id)

 df %>%
  plot_ly(
    x = ~id, 
    y = ~center, 
    frame = ~frame,
    type = 'scatter', 
    mode = 'lines', 
    fill = 'tozeroy', 
    fillcolor='rgba(114, 186, 59, 0.5)',
    line = list(color = 'rgb(114, 186, 59)'),
    text = ~paste("Day: ", id, "<br>duration: s", center), 
    hoverinfo = 'text'
  ) %>%
  layout(
    title = "Median trip distribution of last 30 days",
    yaxis = list(
      title = "Duration(seconds)", 
      range = c(0,1400), 
      zeroline = F,
      tickprefix = "s"
    ),
    xaxis = list(
      title = "Day", 
      range = c(0,30), 
      zeroline = F, 
      showgrid = F
    )
  ) %>% 
  animation_opts(
    frame = 100, 
    transition = 0, 
    redraw = FALSE
  ) %>%
  animation_slider(
    currentvalue = list(
      prefix = "Day "
    )
  )

A lot of thing seems to be going in the above plot of last 30 days. We should really try to extract much of the above information by using lag of every date. Even if this doesn’t work, it surely deserves a try for the effect it has on the future predictions

The unaswered question

I told you at the start that i would answer the question that what is the actual reason that vendor 2 lags behind vendor 1 in every department. Well, here we go.

Well, do your remember the unexplained mystery of why 5-passenger rides had highest ride counts after 1 and 2 passenger rides. Actually, i am not going to answer that mystery, right now, but let’s explore that fact more.

After seeing this pie chart, i hope you get your answer?

#data with passenger_count only equal to 5
sub1 <- train[passenger_count == 5, ]
#data with passenger_count equal to any value other than 5
sub2 <- train[passenger_count != 5, ]

#plot of data having only 5 as passenger_count
hchart(as.character(sub1$vendor_id), type ="pie")

if not then compare the above one with this pie chart

#plot of data having passenger_count other than 5
hchart(as.character(sub2$vendor_id), type = "pie")

Yes, the difference of 8%, which we saw at the start, in the ridership numbers can be greatly explained by the difference between the 5-passenger rides done by both of these vendors. But then what about the different of trip duration between the vendors?

So, now,let us plot mean trip durations

sub1[, date := as.Date(pickup_datetime)]
##               id vendor_id     pickup_datetime    dropoff_datetime
##     1: id0190469         2 2016-01-01 00:00:17 2016-01-01 00:14:26
##     2: id1210365         2 2016-01-01 00:01:01 2016-01-01 00:07:49
##     3: id3205616         2 2016-01-01 00:13:22 2016-01-01 00:38:42
##     4: id1599467         2 2016-01-01 00:19:26 2016-01-01 00:29:39
##     5: id3764623         2 2016-01-01 00:20:15 2016-01-01 00:28:19
##    ---                                                            
## 78084: id2142849         2 2016-06-30 23:35:46 2016-06-30 23:48:27
## 78085: id3864384         2 2016-06-30 23:44:14 2016-07-01 00:46:18
## 78086: id3006845         2 2016-06-30 23:53:15 2016-07-01 00:05:35
## 78087: id2722409         2 2016-06-30 23:56:56 2016-07-01 00:21:32
## 78088: id2332349         2 2016-06-30 23:59:37 2016-07-01 00:23:39
##        passenger_count pickup_longitude pickup_latitude dropoff_longitude
##     1:               5        -73.98174        40.71916         -73.93883
##     2:               5        -73.96528        40.80104         -73.94748
##     3:               5        -73.99238        40.72498         -73.97575
##     4:               5        -73.99803        40.71976         -73.99404
##     5:               5        -73.97171        40.75943         -73.95811
##    ---                                                                   
## 78084:               5        -73.97139        40.76440         -73.99781
## 78085:               5        -73.78207        40.64465         -73.98022
## 78086:               5        -73.99475        40.72304         -74.01648
## 78087:               5        -73.96920        40.78517         -73.94654
## 78088:               5        -73.99829        40.72262         -73.97178
##        dropoff_latitude store_and_fwd_flag trip_duration filter pick_year
##     1:         40.82918                  N           849      0      2016
##     2:         40.81517                  N           408      0      2016
##     3:         40.77958                  N          1520      0      2016
##     4:         40.69547                  N           613      0      2016
##     5:         40.78188                  N           484      0      2016
##    ---                                                                   
## 78084:         40.74030                  N           761      0      2016
## 78085:         40.72390                  N          3724      0      2016
## 78086:         40.70587                  N           740      0      2016
## 78087:         40.71431                  N          1476      0      2016
## 78088:         40.76247                  N          1442      0      2016
##        pick_month pick_day pick_wday pick_hour
##     1:          1        1         6         0
##     2:          1        1         6         0
##     3:          1        1         6         0
##     4:          1        1         6         0
##     5:          1        1         6         0
##    ---                                        
## 78084:          6       30         5        23
## 78085:          6       30         5        23
## 78086:          6       30         5        23
## 78087:          6       30         5        23
## 78088:          6       30         5        23
##                              pickup_point
##     1: 40.7191581726074 -73.9817428588867
##     2: 40.8010406494141 -73.9652786254883
##     3:  40.724983215332 -73.9923782348633
##     4: 40.7197647094727 -73.9980316162109
##     5: 40.7594337463379 -73.9717102050781
##    ---                                   
## 78084: 40.7644004821777 -73.9713897705078
## 78085:  40.644645690918 -73.7820739746094
## 78086: 40.7230415344238 -73.9947509765625
## 78087: 40.7851715087891 -73.9692001342773
## 78088:  40.7226181030273 -73.998291015625
##                             dropoff_point above3       date  distance
##     1: 40.8291816711426 -73.9388275146484    yes 2016-01-01 12770.911
##     2: 40.8151702880859 -73.9474792480469    yes 2016-01-01  2173.304
##     3: 40.7795753479004 -73.9757537841797    yes 2016-01-01  6236.775
##     4: 40.6954727172852 -73.9940414428711    yes 2016-01-01  2725.055
##     5: 40.7818832397461 -73.9581069946289    yes 2016-01-01  2749.644
##    ---                                                               
## 78084: 40.7402992248535 -73.9978103637695    yes 2016-06-30  3487.438
## 78085:  40.7239036560059 -73.980224609375    yes 2016-06-30 18911.196
## 78086: 40.7058715820312 -74.0164794921875    yes 2016-06-30  2648.500
## 78087: 40.7143096923828 -73.9465408325195    yes 2016-06-30  8116.456
## 78088: 40.7624702453613 -73.9717788696289    yes 2016-06-30  4967.990
##        clusters
##     1:       19
##     2:       33
##     3:       24
##     4:        8
##     5:       38
##    ---         
## 78084:       16
## 78085:       35
## 78086:       24
## 78087:       15
## 78088:       24
sub2[, date:= as.Date(pickup_datetime)] 
##                 id vendor_id     pickup_datetime    dropoff_datetime
##       1: id1665586         1 2016-01-01 00:00:53 2016-01-01 00:22:27
##       2: id3888279         1 2016-01-01 00:01:14 2016-01-01 00:05:54
##       3: id0924227         1 2016-01-01 00:01:20 2016-01-01 00:13:36
##       4: id2294362         2 2016-01-01 00:01:33 2016-01-01 00:13:25
##       5: id1078247         2 2016-01-01 00:01:37 2016-01-01 00:03:31
##      ---                                                            
## 1380552: id3952659         2 2016-06-30 23:58:47 2016-07-01 00:12:56
## 1380553: id1255468         2 2016-06-30 23:58:52 2016-07-01 00:06:44
## 1380554: id2013516         1 2016-06-30 23:59:09 2016-07-01 00:11:43
## 1380555: id0376262         1 2016-06-30 23:59:10 2016-07-01 00:41:36
## 1380556: id3719493         2 2016-06-30 23:59:39 2016-07-01 00:43:08
##          passenger_count pickup_longitude pickup_latitude
##       1:               1        -73.98508        40.74717
##       2:               1        -73.98229        40.75133
##       3:               1        -73.97011        40.75980
##       4:               1        -73.98499        40.77389
##       5:               1        -73.97334        40.76407
##      ---                                                 
## 1380552:               2        -73.99140        40.75012
## 1380553:               1        -73.98275        40.74529
## 1380554:               2        -73.95511        40.68956
## 1380555:               2        -73.87309        40.77410
## 1380556:               1        -73.97842        40.79158
##          dropoff_longitude dropoff_latitude store_and_fwd_flag
##       1:         -73.95804         40.71749                  N
##       2:         -73.99134         40.75034                  N
##       3:         -73.98936         40.74299                  N
##       4:         -73.93649         40.84777                  N
##       5:         -73.97485         40.76173                  N
##      ---                                                      
## 1380552:         -73.98929         40.77358                  N
## 1380553:         -73.96900         40.75795                  N
## 1380554:         -73.97820         40.68580                  N
## 1380555:         -73.92670         40.85674                  N
## 1380556:         -73.97316         40.67597                  N
##          trip_duration filter pick_year pick_month pick_day pick_wday
##       1:          1294      0      2016          1        1         6
##       2:           280      0      2016          1        1         6
##       3:           736      0      2016          1        1         6
##       4:           712      0      2016          1        1         6
##       5:           114      0      2016          1        1         6
##      ---                                                             
## 1380552:           849      0      2016          6       30         5
## 1380553:           472      0      2016          6       30         5
## 1380554:           754      0      2016          6       30         5
## 1380555:          2546      0      2016          6       30         5
## 1380556:          2609      0      2016          6       30         5
##          pick_hour                       pickup_point
##       1:         0 40.7471656799316 -73.9850845336914
##       2:         0  40.7513313293457 -73.982292175293
##       3:         0 40.7597999572754 -73.9701080322266
##       4:         0  40.7738914489746 -73.984992980957
##       5:         0 40.7640724182129 -73.9733352661133
##      ---                                             
## 1380552:        23 40.7501220703125 -73.9914016723633
## 1380553:        23  40.745288848877 -73.9827499389648
## 1380554:        23 40.6895637512207 -73.9551086425781
## 1380555:        23  40.774097442627 -73.8730926513672
## 1380556:        23  40.791576385498 -73.9784164428711
##                               dropoff_point above3       date   distance
##       1: 40.7174911499023 -73.9580383300781     no 2016-01-01  4014.6236
##       2:  40.7503395080566 -73.991340637207     no 2016-01-01   771.0093
##       3: 40.7429885864258 -73.9893569946289     no 2016-01-01  2477.3474
##       4:  40.847770690918 -73.9364929199219     no 2016-01-01  9183.4479
##       5:  40.7617340087891 -73.974853515625     no 2016-01-01   290.0838
##      ---                                                                
## 1380552: 40.7735786437988 -73.9892883300781     no 2016-06-30  2617.2467
## 1380553: 40.7579498291016 -73.9690017700195     no 2016-06-30  1824.9950
## 1380554: 40.6858024597168 -73.9782028198242     no 2016-06-30  1993.8556
## 1380555: 40.8567390441895 -73.9267044067383     no 2016-06-30 10248.5996
## 1380556:  40.675968170166 -73.9731597900391     no 2016-06-30 12877.0841
##          clusters
##       1:       48
##       2:       48
##       3:       38
##       4:       27
##       5:       16
##      ---         
## 1380552:        5
## 1380553:       50
## 1380554:       39
## 1380555:       28
## 1380556:       15
new1 <- sub1[, .(mean_trip = mean(trip_duration)), by =  date]
new2 <- sub2[, .(mean_trip = mean(trip_duration)), by = date]

all <- left_join(new1, new2, by = "date")

setnames(all, c("mean_trip.x", "mean_trip.y"), c("y1", "y2"))

highchart() %>% hc_chart(zoomType = "xy") %>% hc_xAxis(data= all$date) %>% hc_add_series(data = all$y1, type = "line", name = "5 passenger") %>% hc_add_series(data = all$y2, type = "line", name = "not 5 passenger") 

The mean trip duration of 5-passenger rides are on the higher side as opposed to the rides that don’t have 5 passegers. The higher trip durations of vendor 2 can be greatly explained by this effect to.

Animation

This plot gives us the changing scenario of pickups and dropoffs, of both vendors together, over time, especially during months.

set.seed(23)
train %>% sample_frac(0.003, replace = F) %>%
   plot_ly(
     x = ~pickup_latitude, 
     y = ~pickup_longitude, 
     size = ~trip_duration, 
     color = ~as.factor(vendor_id), 
     frame = ~pick_month, 
     text = ~pickup_point, 
     hoverinfo = "text",
     type = 'scatter',
     mode = 'markers', 
     colors = "Set1"
   ) %>%
   animation_opts(
     1000, easing = "elastic", redraw = FALSE
   ) %>% 
   animation_button(
     x = 1, xanchor = "right", y = 0, yanchor = "bottom"
   ) %>%
   animation_slider(
     currentvalue = list(prefix = "MONTH ", font = list(color="red"))
   )

The visualisations and necessary information conveyed to our employer, i hope he/she will be satisfied, as for the sixth question, we solve it by doing machine learning process, and this will done in the machine learning tutorial.

Note- This exploration isn’t exhaustive and more new findings and visualisation will be added in the further. Hope this exploration helped you in some way, please upvote, if you found this notebook useful.

Also, i would look into the cluster plot and why it isn’t being displayed properly.